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Abstract 

We study the tradeoff between the statistical error and communication cost of distributed 
statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean es¬ 
timation problem, each of the m machines receives n data points from a d-dimensional Gaussian 
distribution with unknown mean 9 which is promised to be fc-sparse. The machines communi¬ 
cate by message passing and aim to estimate the mean 9. We provide a tight (up to logarithmic 
factors) tradeoff between the estimation error and the number of bits communicated between 
the machines. This directly leads to a lower bound for the distributed sparse linear regres¬ 
sion problem: to achieve the statistical minimax error, the total communication is at least 
fl(min{n, d}m), where n is the number of observations that each machine receives and d is the 
ambient dimension. These lower results improve upon IShal4l 1SD15] by allowing multi-round 
iterative communication model. We also give the first optimal simultaneous protocol in the 
dense case for mean estimation. 

As our main technique, we prove a distributed data processing inequality , as a generalization 
of usual data processing inequalities, which might be of independent interest and useful for other 
problems. 
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1 Introduction 


Rapid growth in the size of modern data sets has fueled a lot of interest in solving statistical 
and machine learning tasks in a distributed environment using multiple machines. Communi¬ 
cation between the machines has emerged as an important resource and sometimes the main 
bottleneck. A lot of recent work has been devoted to design communication-efficient learning 
algorithms |DAW121 IZDW131 fZXTbl 1KVW1411LBKW141ISSZ141 ILSLT15| . 

In this paper we consider statistical estimation problems in the distributed setting, which 
can be formalized as follows. There is a family of distributions V = {pg : 9 £ C M d } that 
is parameterized by 6 £ R d . Each of the m machines is given n i.i.d samples drawn from an 
unknown distribution pg £ V. The machines communicate with each other by message passing, 
and do computation on their local samples and the messages that they receives from others. 
Finally one of the machines needs to output an estimator 6 and the statistical error is usually 
measured by the mean-squared loss E[||0 — 0 1| 2 ]. We count the communication between the 
machines in bits. 

This paper focuses on understanding the fundamental tradeoff between communication and 
the statistical error for high-dimensional statistical estimation problems. Modern large datasets 
are often equipped with a high-dimensional statistical model, while communication of high di¬ 
mensional vectors could potentially be expensive. It has been shown by Duchi et al. [DJWZ141 
and Garg et al. (GMN14; that for the linear regression problem, the communication cost must 
scale with the dimensionality for achieving optimal statistical minimax error - not surpris¬ 
ingly, the machines have to communicate high-dimensional vectors in order to estimate high- 
dimensional parameters. 

These negative results naturally lead to the interest in high-dimensional estimation problems 
with additional sparse structure on the parameter 9. It has been well understood that the 
statistical minimax error typically depends on the intrinsic dimension, that is, the sparsity of 
the parameters, instead of the ambient dimensior0- Thus it is natural to expect that the same 
phenomenon also happens for communication. 

However, this paper disproves this possibility in the interactive communication model by 
proving that for the sparse Gaussian mean estimation problem (where one estimates the mean 
of a Gaussian distribution which is promised to be sparse, see Section [2] for the formal defi¬ 
nition), in order to achieve the statistical minimax error, the communication must scale with 
the ambient dimension. On the other end of the spectrum, if alternatively the communication 
only scales with the sparsity, then the statistical error must scale with the ambient dimension 
(see Theorem 14. 511 . Shamir [Shal4; establishes the same result for the 1-sparse case under a 
non-iterative communication model. 

Our lower bounds for the Gaussian mean estimation problem imply lower bounds for the 
sparse linear regression problem fCorollarv l4.8l) via the reduction of IZD.TW 13] : for a Gaussian 
design matrix, to achieve the statistical minimax error, the communication cost per machine 
needs to be fl(min{n, d}) where d is the ambient dimension and n is the dimension of the 
observation that each machine receives. This lower bound matches the upper bound in |LSLT15l 
when n is larger than d. When n is less than d, we note that it is not clear whether 0(n) 
or 0{d) should be the minimum communication cost per machine needed. In any case, our 
contribution here is in proving a lower bound that does not depend on the sparsity. Compared 
to previous work of Steinhardt and Duchi |SD15j . which proves the same lower bounds for a 
memory-bounded model, our results work for a stronger communication model where multi¬ 
round iterative communication is allowed. Moreover, our techniques are possibly simpler and 
potentially easier to adapt to related problems. For example, we show that the result of Woodruff 
and Zhang IWZ1 ‘2| on the information complexity of distributed gap majority can be reproduced 
by our technique with a cleaner proof (see Theorem 1C. II) . 

1 the dependency on the ambient dimension is typically logarithmic. 


1 




















We complement our lower bounds for this problem in the dense case by providing a new 
simultaneous protocol, improving the number of rounds of the previous communication-optimal 
protocol from O(logm) to 1 (see Theorem [46]). Our protocol is based on a certain combination 
of many bits from a few Gaussian samples, together with roundings (to a single bit) of the 
fractional parts of many Gaussian samples. 

Our proof techniques are potentially useful for other questions along these lines. We first use 
a modification of the direct-sum result of IGMN14] , which is tailored towards sparse problems, to 
reduce the estimation problem to a detection problem. Then we prove what we call a distributed 
data processing inequality for bounding from below the cost of the detection problem. The latter 
is the crux of our proofs. We elaborate more on it in the next subsection. 

1.1 Distributed Data Processing Inequality 

We consider the following distributed detection problem. As we will show in Section [4] (by a 
direct-sum theorem), it suffices to prove a tight lower bound in this setting, in order to prove a 
lower bound on the communication cost for the sparse linear regression problem. 

Distributed detection problem: We have a family of distributions V that consist of only 
two distributions {/j-o, Mi}, and the parameter space 12 = {0,1}. To facilitate the use of tools 
from information theory, sometimes it is useful to introduce a prior over the parameter space. 

Let V ~ B q be a Bernoulli random variable with probability q of being 1. Given V = v £ {0,1}, 
we draw i.i.d. samples X \,... ,X m from fi v and the j -th machine receives one sample Xj , for 
j = 1,..., to. We use II e {0,1}* to denote the sequences of messages that are communicated 
by the machines. We will refer to II as a “transcript”, and the distributed algorithm that the 
machines execute as a “protocol”. 

The final goal of the machines is to output an estimator for the hidden parameter v which is 
as accurate as possible. We formalize the estimator as a (random) function v : {0,1}* —> {0,1} 
that takes the transcript II as input. We require that given V = v, the estimator is correct with 
probability at least 3/4, that is, min„ e { 0 ,i} Pr[#(II) = v | V = v] > 3/4. When q = 1/2, this 
is essentially equivalent to the statement that the transcript II carries 12(1) information about 
the random variable V. Therefore, the mutual information l(V ; II) is also used as a convenient 
measure for the quality of the protocol when q = 1/2. 

Strong data processing inequality: The mutual information viewpoint of the accuracy 
naturally leads us to the following approach for studying the simple case when m = 1 and 
q = 1/2. When m = 1, we note that the parameter V, data X , and transcript II form a simple 
Markov chain V —> X —> II. The channel V —> X is defined as X ~ /z„, conditioned on V = v. 

The strong data processing inequality (SDPI) captures the relative ratio between I( V ; II) and 

i(x ; n). 

Definition 1 (Special case of SDPI). Let V ~ -B 1/2 and the channel V —> X be defined as 
above. Then there exists a constant (3 < 1 that depends on /To and pb \, such that for any II that 
depends only on X (that is, V —> X —> II forms a Markov Chain), we have 

I(V;n)</M(X;II). (1) 

An inequality of this type is typically referred to as a strong data processing inequality for 
mutual information when (3 < 1 q Let /3(/xo,/xi) be the infimum over all possible /3 such that 
(0 is true, which we refer to as the SDPI constant. 

Observe that the LHS of m measures how much information II carries about V, which is 
closely related to the accuracy of the protocol. The RHS of JT]) is a lower bound on the expected 

2 Inequality 0 is always true for a Markov chain V —> A' —II with (3 = 1 and this is called the data processing 
inequality. 
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length of II, that is, the expected communication cost. Therefore the inequality relates two 
quantities that we are interested in - the statistical quality of the protocol and the communication 
cost of the protocol. Concretely, when q = 1/2, in order to recover V from II, we need that 
I(V;II) > n(l), and therefore inequality (JT]) gives that I(X;II) > D(/3 _1 ). Then it follows from 
Shannon’s source coding theory that the expected length of II (denoted by |II|) jis bounded from 
below by E[|n|] > fl(/3 _1 ). We refer to |Ragl4| for a thorough survey of SDPld 

In the multiple machine setting, Duchi et al. IDJWZ14] links the distributed detection prob¬ 
lem with SDPI by showing from scratch that for any m, when q = 1/2, if /3 is such that 
(1 - y//3)m < no < (1 + \//3 )m, then 

I(V;II) </3-I(X 1 ...X m ;n). 

This results in the bounds for the Gaussian mean estimation problem and the linear regression 
problem. The main limitation of this inequality is that it requires the prior B q to be unbiased 
(or close to unbiased). For our target application of high-dimensional problems with sparsity 
structures, like sparse linear regression, in order to apply this inequality we need to put a very 
biased prior B q on V. The proof technique of [ I )■! WZ1 41 seems also hard to extend to this case 
with a tight bounc:0. Moreover, the relation between /3, p,o and may not be necessary (or 
optimal), and indeed for the Gaussian mean estimation problem, the inequality is only tight up 
to a logarithmic factor, while potentially in other situations the gap is even larger. 

Our approach is essentially a prior-free multi-machine SDPI, which has the same SDPI 
constant /3 as is required for the single machine one. We prove that, as long as the SDPI JT]) 
for a single machine is true with parameter (3, and p.o < 0(l)fx±, then the following prior-free 
multi-machine SDPI is true with the same constant (3 (up to a constant factor). 

Theorem 1.1 (Distributed SDPI). Suppose i • fxo < Mi < c To f or some constant c > 1, and 
let /3(^0jMi) be the SDPI constant defined in Definition QJ Then in the distributed detection 
problem, we have the following distributed strong data processing inequality, 

h 2 (n|y =0 ,n| y= i) <Kcfi(no,/J-i)-nan{l(X 1 ...X m -Il\ V = 0), I(AT ... X m ; II | V = 1)} (2) 

where I\ is a universal constant, and h(-,-) is the Hellinger distance between two distributions 
and II|y =v denotes the distribution of II conditioned on V = v. 

Moreover, for any po and which satisfy the condition of the theorem, there exists a protocol 
that produces transcript II such that m is tight up to a constant factor. 

As an immediate consequence, we obtain a lower bound on the communication cost for the 
distributed detection problem. 

Corollary 1.2. Suppose the protocol and estimator (II, v) are such that for any v £ {0,1}, 
given V = v , the estimator v (that takes II as input) can recover v with probability 3/4. Then 

max E[|n| I V = ul > D(/3 _1 ). 

^elo.i} 

Our theorem suggests that to bound the communication cost of the multi-machine setting 
from below, one could simply work in the single machine setting and obtain the right SDPI 
constant ft. Then, a lower bound of ^(/C 1 ) for the multi-machine setting immediately follows. 

In other words, multi-machines need to communicate a lot to fully exploit the m data points 
they receive (1 on each single machine) regardless of however complicated their multi-round 
protocol is. 

3 A1so note that in information theory, SDPI is typically interpreted as characterizing how information decays 
when passed through the reverse channel X —> V. That is, when the channel A' —> V is lossy, then information 
about II will decay by a factor of /3 after passing X through the channel. However, in this paper we take a different 
interpretation that is more convenient for our applications. 

4 We note, though, that it seems possible to extend the proof to the situation where there is only one-round of 
communication. 
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Remark 1. Note that our inequality differs from the typical data processing inequality on 
both the left and right hand sides. First of all, the RHS of ([2]) is always less than or equal to 
I(Xi ... X rn : II | V ) for any prior B q on V. This allows us to have a tight bound on the expected 
communication E[|n|] for the case when q is very small. 

Second, the squared Hcllinger distance (see Definition |4| on the LHS of 0 is not very far 
away from I(II; V), especially for the situation that we consider. It can be viewed as an alter¬ 
native (if not more convenient) measure of the quality of the protocol than mutual information 
- the further n|y = o from n|y = i, the easier it is to infer V from II. When a good estimator is 
possible (which is the case that we are going to apply the bound in), Hellinger distance, total 
variation distance between n|y = o and II|y = i, and I(F;II) are all 12(1). Therefore in this case, 
the Hellinger distance does not make the bound weaker. 

Finally, suppose we impose a uniform prior for V. Then the squared Hellinger distance 
is within a constant factor of I(V; n) (see Lemma [lOl and the lower bound side was proved 
by (HY.TKS04j b 

2h 2 (n|y =0 ,n|y=i) > i(v ; n) > h 2 (n|y =0 ,n|y =1 ). 

Therefore, in the unbiased case, 0 implies the typical form of the data processing inequality. 

Remark 2. The tightness of our inequality does not imply that there is a protocol that solves 
the distributed detection problem with communication cost (or information cost) 0(/3 -1 ). We 
only show that inequality 0 is tight for some protocol but solving the problem requires having 
a protocol such that 0 is tight and that h 2 (n|y =0 , n|y = i) = 12(1). In fact, a protocol for which 
inequality 0 is tight is one in which only a single machine sends a message n which maximizes 
I(H; V)/l{IL-X). 

Organization of the paper: Section [2] formally sets up our model and problems and intro¬ 
duces some preliminaries. Then we prove our main theorem in Section [3j In Section [4] we state 
the main applications of our theory to the sparse Gaussian mean estimation problem and to the 
sparse linear regression problem. The next three sections are devoted to the proofs of results 
in Section 0 In Section 0 we prove Theorem 14.41 and in Section El we prove Theorem 14.31 and 
Corollary 14.81 In Section 0 we provide tools for proving single machine strong data processing 
inequality and prove Theorem 14.11 In Section [B] we present our matching upper bound in the 
simultaneous communication model. In section E3 we give a simple proof of distributed gap 
majority problems using our machinery. 


2 Problem Setup, Notations and Preliminaries 

2.1 Distributed Protocols and Parameter Estimation Problems 

Let V = {ne '■ 9 £ 12} be a family of distributions over some space X, and 12 C be the 
space of all possible parameters. There is an unknown distribution fig £ V , and our goal is to 
estimate a parameter 6 using m machines. Machine j receives n i.i.d samples xj 1 \, X^ n> 
from distribution fig . For simplicity we will use Xj as a shorthand for all the samples machine 
j receives, that is, Xj = {X^\ ..., X.j”' 1 ). Therefore Xj ~ fig, where fi n denotes the product 
of n copies of f i . When it is clear from context, we will use X as a shorthand for (Xi,..., X m ). 
We define the problem of estimating parameter 8 in this distributed setting formally as task 
T(n,m,V). When 12 = {0,1}, we call this a detection problem and refer it to as Td e t{n,m,V). 

The machines communicate via a publicly shown blackboard. That is, when a machine 
writes a message on the blackboard, all other machines can see the content. The messages that 
are written on the blackboard are counted as communication between the machines. Note that 
this model captures both point-to-point communication as well as broadcast communication. 
Therefore, our lower bounds in this model apply to both the message passing setting and the 
broadcast setting. 
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We denote the collection of all the messages written on the blackboard by II. We will refer 
to II as the transcript and note that II £ {0,1}* is written in bits and the communication cost is 
defined as the length of II, denoted by |II|. We will call the algorithm that the machines follow 
to produce II a protocol. With a slight abuse of notation, we ue II to denote both the protocol 
and the transcript produced by the protocol. 

One of the machines needs to estimate the value of 9 using an estimator 9 : {0,1}* —> 
which takes II as input. The accuracy of the estimator on 9 is measured by the mean-squared 
loss: 


j?((ii,0),0) = e || 0 (n) -e\\ 


where the expectation is taken over the randomness of the data X, and the estimator 9. The 
error of the estimator is the supremum of the loss over all 9 , 


R(U, 6) = supE 
sen 




(3) 


The communication cost of a protocol is measured by the expected length of the transcript II, 
that is, CC(II) = sup ee Q E[|II|]. The information cost IC of a protocol is defined as the mutual 
information between transcript II and the data X, 


IC(n) =supIe(n;A' I flp Ub ) (4) 

eefi 

where R pu b denotes the public coin used by the algorithm and Ig( II; A \ R pu b) denotes the mu¬ 
tual information between random variable X and II when the data A' is drawn from distribution 
f-ig. We will drop the subscript 9 when it is clear from context. 

For the detection problem, we need to define minimum information cost, a stronger version 
of information cost 

min-IC(II) = min I„(II ;X \ R pVL b) (5) 

^£{ 0 , 1 } 

Definition 2. We say that a protocol and estimator pair (II, 9) solves the distributed estimation 
problem T(m, n, d, 12, V) with information cost /, communication cost C, and mean-squared loss 
R if IC(n) < /, CC(n) < C and i?(II, 9) < R. 

When 12 = {0,1}, we have a detection problem, and we typically use v to denote the 
parameter and v as the (discrete) estimator for it. We define the communication and information 
cost the same as (EH) and 0, while defining the error in a more meaningful and convenient 
way, 

Rdet(n,v) = max Pr[D(II) ^ v | V = v] 

u£{0,l} 

Definition 3. We say that a protocol and estimator pair (II, v) solves the distributed detection 
problem T de t{m,n, d,i 2,'P) with information cost /, if IC(II) < J, Rdet{ II, -0) < 1/4. 

Now we formally define the concrete questions that we are concerned with. 

Distributed Gaussian detection problem: We call the problem with 12 = {0,1} and 
V = {W(0, cr 2 ) n ,J\f(5, cr 2 ) n } the Gaussian mean detection problem, denoted by GD(n, to, 5, a 2 ). 
Distributed (sparse) Gaussian mean estimation problem: The distributed statistical 
estimation problem defined by 12 = M. d and V = {A f(9, cr 2 I dxd ) : 9 £ 12} is called the distributed 
Gaussian mean estimation problem, abbreviated GME(n, to, d. cr 2 ). When 12 = {9 £ : |0|q < 

k}, the corresponding problem is referred to as distributed sparse Gaussian mean estimation, 
abbreviated SGME(?z, m, d, k, a 2 ). 

Distributed sparse linear regression: For simplicity and the purpose of lower bounds, we 
only consider sparse linear regression with a random design matrix. To fit into our framework, 
we can also regard the design matrix as part of the data. We have a parameter space 12 = {9 £ 
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K. d : |0| o < k). The j-th data point consists of a row of design matrix Aj and the observation 
y-j = ( Aj,6) + Wj where Wj ~ Af(0 , <r 2 ) for j = 1,... ,mn, and each machine receives n data 
points among thenfl. Formally, let gg denote the joint distribution of ( Aj,yj ) here, and let 
V = {fig : 6 g fi}. We use SLR(n, m, d, k, a 2 ) as shorthand for this problem. 

2.2 Hellinger distance and cut-paste property 

In this subsection, we introduce Hellinger distance, and the key property of protocols that 
we exploit here, the so-called “cut-paste” property developed by |BYJKS04] for proving lower 
bounds for set-disjointness and other problems. We also introduce some notation that will be 
used later in the proofs. 

Definition 4 (Hellinger distance). Consider two distributions with probability density functions 
/, g : Q —> M. The square of the Hellinger distance between / and g is defined as 

h 2 (/,g) : = \ ' J - t/sW) dx 

A key observations regarding the property of a protocol by [BY.TKS04I Lemma 16] is the 
following: fixing Xi = xi,.. ■, X m = x m , the distribution of A\x- X can be factored in the 
following form, 

Pr[n = 7T | X = x] =Pl,„(x 1 ) . . .p m ,A X m) (6) 

where 7r (-) is a function that only depends on i and the entire transcript ir . To see this, one 
could simply write the density of n as a products of density of each messages of the machines 
and group the terms properly according to machines (and note that Pi l7 r(‘) is allowed to depend 
on the entire transcript ir). 

We extend equation d6j) to the situation where the inputs are from product distributions. 

For any vector b g {0, l} m , let gb '■= Pb x x ■ ■ ■ x be a distribution over X m . We denote by 
n;, the distribution of n(Ai,..., X m ) when (Xi, ..., X rn ) ~ gb- 

Therefore if A' ~ /.ib, using the fact that /.ib is a product measure, we can marginalize over 
X and obtain the marginal distribution of n when A' ~ /.ib, 

Pi [n — 7r] — 9l,7r(^l) • • ■ Qm,7r(Pm)i (7) 

X~n b 

where qj, n (pj ) is the marginalization of pj^(x) over x ~ pb-j , that is, qj t n(bj) = f x Pj,n(x)dpbj ■ 

Let n b denote the distribution of n when A ~ p b - Then by the decomposition Q of n b (7r) 
above, we have the following cut-paste property for ^ which will be the key property of a 
protocol that we exploit. 

Proposition 2.1 (Cut-paste property of a protocol). For any a,b and c,d with {ai,bi} = 
{ci,di} (in a multi-set sense) for every i € [m], 

n a (7r) • n b (7r) = n c (7r) • n d (7r) (8) 

and therefore, 

h 2 (n a ,n b ) = h 2 (n c ,n d ) (9) 

5 We note that here for convenience, we use subscripts for samples, which is different from the notation convention 
used for previous problems. 
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3 Distributed Strong Data Processing Inequalities 

In this section we prove our main Theorem ll.il We state a slightly weaker looking version here 
but in fact it implies Theorem [Tj] by symmetry. The same proof also goes through for the case 
when the RHS is conditioned on V = 1. 

Theorem 3.1. Suppose pi < c- po, and fi(pchPi) = /?, we have 

h 2 (n| y=0 ,n|y =1 ) < K{c +1)/3 ■ I(X;n I y = 0). (io) 

where K is an absolute constant. 

Note that the RHS of (flOl) naturally tensorizes (by Lemma [T| that appears below) in the sense 
that 

m 

£i(x 4 ; n|y = o)<i(x ; n|y = o), (11) 

»=i 

since conditioned on V = 0, the Xi s are independent. Our main idea consists of the following 
two steps a) We tensorize the LHS of (flUl) so that the target inequality (flUl) can be written as a 
sum of m inequalities, b) We prove each of these m inequalities using the single machine SDPI. 

To this end, we do the following thought experiment: Suppose W is a random variable that 
takes value from {0,1} uniformly. Suppose data X' is generated as follows: X'- ~ pw> and for 
any j ^ i, X' ~ p 0 . We apply the protocol on the input X', and view the resulting transcript 
n' as communication between the i-th machine and the remaining machines. Then we are in 
the situation of a single machine case, that is, W —> X[ —> n' forms a Markov Chain. Applying 
the data processing inequality ©> we obtain that 

i(w ; n')</ 3 i(x' ; n')- (12) 

Using Lemma [TOJ we can lower bound the LHS of (fl2l) by the Hellinger distance and obtain 

h 2 (nv= 0 ,nv=i) < /3-i(x';n') 

Let ei = (0,0,..., 1,..., 0) be the unit vector that only takes 1 in the *th entry, and 0 the all 
zero vector. Using the notation defined in Section [2721 we observe that n'| jy =0 has distribution 
Ho while n'ly^i has distribution Tl ei . Then we can rewrite the equation above as 

h 2 (n 0 ,n e .) </M(x';n') ( 13 ) 

Observe that the RHS of (HHil) is close to the first entry of the LHS of (flTT) since the joint 
distribution of (X(,n') is not very far from X,n | V = 0. (The only difference is that X( is 
drawn from a mixture of and pi, and note that po is not too far from p\). On the other 
hand, the sum of LHS of (fl3l) over i £ [m] is lower-bounded by the LHS of (flOl) . Therefore, 
we can tensorize equation (HOD into inequality H3D which can be proved by the single machine 
SDPI. We formalize the intuition above by the following two lemmas, 

Lemma 1. Suppose p\ < c ■ po, and /3(po,pi) = /3, then 

h 2 (n ei , no) < (c+ 2 1)/j • i(x i; n | y = o) ( 14 ) 

Lemma 2. Let 0 be the m-dimensional all 0’s vector, and 1 the all l’s vector, we have that 

m 

^(no,^) < o(i) • ^h 2 (n ei ,n 0 ) (15) 

2=1 
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Using Lemma [T] and Lemma [5] we obtain Theorem 13.11 straightforwardly by combining 
inequalities (HU)- (HI and ms0 

Finally we provide the proof of Lemma [1] Lemma [2] is a direct corollary of Theorem IE.II 
(which is in turn a direct corollary of Theorem 7 of | Jay09| ) and Proposition l2.il 

Proof of Lemma QJ Let W be uniform Bernoulli random variable and define X' and IT as fol¬ 
lows: Conditioned on W = 0, X' ~ ho and conditioned on W = 1, X' ~ /z ei . We run protocol 
on X' and get transcript IT. 

Note that V —> X' —> IT is a Markov chain and so is V —> X[ —> IT. Also by definition, 
the conditional random variable X'\V has the same distribution as the random variable X\V in 
Definition |I] Therefore by Definition [TJ we have that 

/M(X';IT) >I(U;IT). (16) 

It is known that mutual information can be expressed as the expectation of KL divergence, which 
in turn is lower-bounded by Hcllinger distance. We invoke a technical variant of this argument, 
Lemma 6.2 of |BJKS041 . restated as Lemma fTOl to lower bound the right hand side. Note that 
Z in Lemma ITOl corresponds to V here and <p Zl , <p Z2 corresponds to II ei and IIq. Therefore, 


I(V;IT) > h 2 (n ei ,n 0 ). ( 17 ) 

It remains to relate I(X(; IT) to I(W;LI | V = 0). Note that the difference between joint 
distributions of (A7,IT) and (Xj,n)|y=o is that X[ ~ ^(ho + Hi) and Xi\v=o ~ Ho- We claim 
(by Lemma fill) that since Ho > -A-( A1 °+ /Jl ), we have 

i(x, ; n | u = o) > —•i(x' ; n'). (is) 

c+ 1 

Combining equations (nm, o and m, we obtain the desired inequality. 

□ 


4 Applications to Parameter Estimation Problems 

4.1 Warm-up: Distributed Gaussian mean detection 

In this section we apply our main technical Theorem 13.11 to the situation when ho = -A/"(0, cr 2 ) 
and Hi = A/ r (d, cr 2 ). We are also interested in the case when each machine receives n samples 
from either ho or hi- We will denote the product of n i.i.d copies of Hv by /z™, for v £ {0,1}. 

Theorem lihTl requires that a) /3 = /3(ho,Hi) can be calculated/estimated b) the densities of 
distributions Ho an d Hi are within a constant factor with each other at every point. 

Certainly b) is not true for any two Gaussian distributions. To this end, we consider HoiHii 
the truncation of Ho an d Hi on some support [—r, r], and argue that the probability mass outside 
[—t, r] is too small to make a difference. 

For a), we use tools provided by Raginsky |Ragl4| to estimate the SDPI constant /3. |Ragl4| 
proves that Gaussian distributions Ho an d Hi have SDPI constant (3{hotHi) ^ 0{S 2 /a 2 ) 1 and 
more generally it connects the SDPI constants to transportation inequalities. We use the frame¬ 
work established by |Ragl4| and apply it to the truncated Gaussian distributions Ho an d Hi- 
Our proof essentially uses the fact that (/zj, + h'i)/ 2 is a log-concacve distribution and there¬ 
fore it satisfies the log-Sobolev inequality, and equivalently it also satisfies the transportation 
inequality. The details and connections to concentration of measures are provided in Section 15751 

®Note that IIo is the same distribution as II|y=o under the notation introduced in Section 12.21 















Theorem 4.1. Let p' 0 and p[ be the distributions obtained by truncating po and p 1 on support 
[—r, r] for some t > 0. If 5 < a, we have /3(p' 0 , p[) < 5 2 /a 2 . 

As a corollary, the SDPI constant between n copies of p' 0 and p[ is bounded by nd 2 /a 2 . 

Corollary 4.2. Let po and p\ be the distributions over R” that are obtained by truncating p$ 
and pdf outside the ball B = {x £ R" : \x\ + ■ ■ ■ + x n \ < r}. Then when y/nS < a, we have 

P{p o, Ai) < nS 2 /a 2 

Applying our distributed data processing inequality ('Theorem 13. II) on po and jli, we obtain 
directly that to distinguish po and p\ in the distributed setting, f1 (^^ 2 ^ communication is 
required. By properly handling the truncation of the support, we can prove that it is also true 
with the true Gaussian distribution. 

Theorem 4.3. Any protocol estimator pair (II, -0) that solves the distributed Gaussian mean 
detection problem GD(n, m, <5, a 2 ) with 6 < a/yfn requires communication cost and minimum 
information cost at least, 

E[|n|] > min-IC(II) > f l 

Remark 3. The condition 6 < a/y/n captures the interesting regime. When 6 a/y/n, a 
single machine can even distinguish po and pi by its local n samples. 

Proof of Theorem \4-3\ Let n 0 and III be the distribution of II| W = 0 and II |V = 1 as defined in 
Section [2~2l Since v solves the detection problem, we have that ||n 0 — IIi||tv > 1/4. It follows 
from Lemma|n]that h(n 0 ,II 1 ) > fl(l). 

We pick a threshold r = 20a, and let S = {z£ R” : \zy + ■ ■ ■ + z n \ < yfnr}. Let F = 1 
denote the event that X = (X 1; ..., X n ) £ B, and otherwise F = 0. Note that Pr[F = 1] > 0.95 
and therefore even if we conditioned on the event that F = 1, the protocol estimator pair should 
still be able to recover v with good probability in the sense that 

Pr[f)(n(X)) = v | V = v, F = 1] > 0.6 (19) 

We run our whole argument conditioning on the event F = 1. First note that for any 
Markov chain V —> X —> II, and any random variable F that only depends on X, the chain 
V"|f =1 —> X\p-i —> II|f=i is also a Markov Chain. Second, the channel from V to X\p=i 
satisfies that random variable X\v=v,f=i has the distribution fi v as defined in the statement of 
Corollary 14.21 Note that by Corollarv l4.21 we have that /3(/to,/ti) < nd 2 /a 2 . Also note that by 
the choice of r and the fact that S < 0(a/y/n), we have that for any z £ B, p,o(z) < 0(\)-pi{z). 
Therefore we are ready to apply Theorem 13.11 and conclude that 

2 

I(X;n I V = 0 ,F = 1) > L!(/3(Ao,Ml)- 1 ) = n(^) 

Note that II is independent with F conditioned on X and V = 0. Therefore we have that 

2 

I(X; n I V = 0) > I(X; n I F, V = 0) > I(X; U\F = 1, V = 0) Pr [F = 1 I V = 0] = 

no z 

Note that by construction, it is also true that p 0 < 0(l)pi, and therefore if we switch the 
position of po,pi and run the argument above we will have 

i(x ; n \V = 1) = n(^) 

Hence the proof is complete. 

□ 
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4.2 Sparse Gaussian mean estimation 

In this subsection, we prove our lower bound for the sparse Gaussian mean estimation problem 
via a variant of the direct-sum theorem of (GMN14] tailored towards sparse mean estimation. 

Our general idea is to make the following reduction argument: Given a protocol IF for d- 
dimensional £;-sparse estimation problem with information cost I and loss R , we can construct 
a protocol II' for the detection problem with information cost roughly I/d and loss R/k. The 
protocol IF embeds the detection problem into one random coordinate of the d-dimensional 
problem, prepares fake data on the remaining coordinates, and then runs the protocol II on 
the high dimensional problem. It then extracts information about the true data from the 
corresponding coordinate of the high-dimensional estimator. 

The key distinction from the construction of [GMN14] is that here we are not able to show 
that IF has small information cost, but only able to show that II' has a small minimum in¬ 
formation cost 0. This is the reason why in Theorem 14.31 we needed to bound the minimum 
information cost instead of the information cost. 

To formalize the intuition, let V = {po,pi} define the detection problem. Let Ll d ,k,s = {0 '■ 
9 £ {0,<5} d , |0| o < k} and Q d ,k,S = {he = Pg 1 /s x ■ ■ • x p 9d / S : 9 £ Cl d ,k,s}- Therefore Q is a 
special case of the general fc-sparse high-dimensional problem. We have that 

Theorem 4.4 (Direct-sum for sparse parameters). Let d > 2k, and V and Q defined as above. 
If there exists a protocol estimator pair (11,0) that solves the detection task T(n,m, Q) with 
information cost I and mean-squared loss R < j^kS 2 , then there exists a protocol estimator 
pair (IF,D / ) (shown in Protocol}]} in Section [3|j that solves the task T det (n,m,V) with minimum 
information cost d _ I k+1 ■ 

The proof of the theorem is deferred to Section [5] Combining Theorem 14.31 and Theorem 
14.41 we get the following theorem: 

Theorem 4.5. Suppose d > 2k. Any protocol estimator pair (II, -0) that solves the k-sparse 
Gaussian mean problem SGME(n, m, d, k, a 2 ) with mean-squared loss R and information cost I 
and communication cost C satisfy that 


( . f a 2 k f a 2 dk 
R > i 2 I mm < -, max < ——, 




( 20 ) 


Intuitively, to parse equation (1201) . we remark that the term comes from the fact that any 
local machine can achieve this error 0( 2 ^) using only its local samples, and the term 
is the minimax error that the machines can achieve with infinite amount of communication. 

When the target error is between these two quantities, equation lj201) predicts that the minimum 
communication C should scale inverse linearly in the error R. 

Our theorem gives a tight tradeoff between C and R up to logarithmic factor, since it is 
known |GMN14| that for any communication budget C, there exists protocol which uses C bits 
and has error R < O ^rriin j , max j j j • log dj . 

As a side product, in the case when k = d/2, our lower bound improves previous works (D.TWZ14] 
and |GMN14] by a logarithmic factor, and turns out to match the upper bound in [GMN14] up 
to a constant factor. 

Proof of Theorem \4-5\ If R < then we are done. Otherwise, let <5 := \/lQR/k < a/yjn. 

Let po = W(0, a 2 ) and pi = Af(S, a 2 ) and V = {p 0 , pi}. Let Qd,fc,5 = {pe = pe 1 /s x ■ ■ ■ x pe d /s ■ 

9 £ Ll d ,k,s}- Then T(n,m, Q) is just a special case of sparse Gaussian mean estimation prob¬ 
lem SGME (n,m,d,k,a 2 ), and T(n,m,V) is the distributed Gaussian mean detection problem 

'This might be inevitable because protocol II might reveal a lot information for the nonzero coordinate of 6 but 
since there are very few non-zeros, the total information revealed is still not too much. 
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GD (n,m,S,a 2 ). Therefore, by Theorem 14.41 there exists (II', v') that solves GD(n, to, 5, a 2 ) 
with minimum information cost /' = 0{I/d). Since 8 < 0(a/y/n), by Theorem 14.31 we have 
that /' > fl(a 2 /(nS 2 )). It follows that I > Q(da 2 /(nS 2 )) = £l(kda 2 /(nR)). To derive (1201) . we 
observe that tt(a 2 k/nm) is the minimax lower bound for R , which completes the proof. □ 

To complement our lower bounds, we also give a new protocol for the Gaussian mean es¬ 
timation problem achieving communication optimal up to a constant factor in any number of 
dimensions in the dense case. Our protocol is a simultaneous protocol , whereas the only previous 
protocol achieving optimal communication requires 12 (log m) rounds [GMN14| . This resolves an 
open question in Remark 2 of [GMN14] . improving the trivial protocol in which each player 
sends its truncated Gaussian to the coordinator by an O(logm) factor. 

Theorem 4.6. For any 0 < a < 1, there exists a protocol that uses one round of communication 
for the Gaussian mean estimation problem GME(n, to, d , a 2 ) with communication cost C = adm 
and mean-squared loss R = O ( - 2 -A ). 

The protocol and proof of this theorem are deferred to Section E though we mention a 
few aspects here. We first give a protocol under the assumption that l^oo < . The protocol 

trivially generalizes to d dimensions so we focus on 1 dimension. The protocol coincides with the 
first round of the multi-round protocol in IGMN14] , yet we can extract all necessary information 
in only one round, by having each machine send a single bit indicating if its input Gaussian is 
positive or negative. Since the mean is on the same order as the standard deviation, one can 
bound the variance and give an estimator based on the Gaussian density function. In Section 
IB.li the mean of the Gaussian is allowed to be much larger than the variance, and this no longer 
works. Instead, a few machines send their truncated inputs so the coordinator learns a crude 
approximation. To refine this approximation, in parallel the remaining machines each send a 
bit which is 1 with probability x — ■ where x is the machine’s input Gaussian. This can be 

viewed as rounding a sample of the “sawtooth wave function” h applied to a Gaussian. For 
technical reasons each machine needs to send two bits, another which is 1 with probability 
(x + 1/5) — [(£ + 1/5)J. We give an estimator based on an analysis using the Fourier series of h. 


Sparse Gaussian estimation with signal strength lower bound Our techniques 
can also be used to study the optimal rate-communication tradeoffs in the presence of a strong 
signal in the non-zero coordinates, which is sometimes assumed for sparse signals. That is, 
suppose the machines are promised that the mean 9 G lZ d is k -sparse and also if 9i ^ 0, then 
1 9i | > 77 , where 77 is a parameter called the signal strength. We get tight lower bounds for this 
case as well. 


Theorem 4.7. For d > 2k and rj 2 > 16 R/k, any protocol estimator pair (II, v) that solves the 
k-sparse Gaussian mean problem SGME(n, to, d, fc, a 2 ) with signal strength rj and mean-squared 


loss R requires information cost (and hence expected communication cost) at least f2 



Note that there is a protocol for SGME(n, to, d, k, a 2 ) with signal strength 77 and mean- 
squared loss R that has communication cost O ^min Tor}) • hi the regime where 

rj 2 > 16 R/k, the first term dominates and by Theorem 14.71 and the fact that is a lower 
bound even when the machines know the support |GMN14] . we also get a matching lower bound. 
In the regime where 77 2 < 16 R/k, second term dominates and it is a lower bound by Theorem 

031 


Proof of Theorem The proof is very similar to the proof of Theorem 14.41 Given a protocol 
estimator pair (II, v) that solves SGME(n, to, d, k, a 2 ) with signal strength 77 , mean-squared 
loss R and information cost / (where rj 2 > 16 R/k), we can find a protocol IT that solves the 
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Gaussian mean detection problem GD(n.m, y, a 2 ) with information cost < 0(1 /d) (as usual the 
information cost is measured when the mean is 0). IT' would be exactly the same as Protocol Q] 
but with no replaced by Af(0, a 2 ), fi± replaced by Af(r], a 2 ) and 5 replaced by rj. We leave the 
details to the reader. □ 

4.3 Lower bound for Sparse Linear Regression 

In this section we consider the sparse linear regression problem SLR(n, m, d, k, a 2 ) in the dis¬ 
tributed setting as defined in Section [2] Suppose the i-th machine receives a subset S t of the mn 
data points, and we use £ M. nxd to denote the design matrix that the i-th machine receives 
and ysi to denote the observed vector. That is, ys t = 4^0 + ws i: where ws t ~ Af(0,a 2 I nxn ) 
is Gaussian noise. 

This problem can be reduced from the sparse Gaussian mean problem, and thus its com¬ 
munication can be lower-bounded. It follows straightforwardly from our Theorem 14.51 and the 
reduction in Corollary 2 of DJWZ141 . To state our result, we assume that the design matrices 
Asi have uniformly bounded spectral norm A yfn. That is, A = maxi<,;< m Ps J/Vn. 

Corollary 4.8. Suppose machines receive data from the sparse linear regression model. Let A 
be as defined above. If there exists a protocol under which the machines can output an estimator 
6 with mean squared loss R = E[||0 — 0 1| 2 ] with communication C, then R ■ C > P(u^r)- 

When is a Gaussian design matrix, that is, the rows of As t are i.i.d drawn from distri¬ 
bution Af(0, Idxd), we have A = O ^maxly'd/n, 1}^ and Corollary 14.81 implies that to achieve 

the statistical minimax rate R = 0(^ 2 —), the algorithm has to communicate 0(m ■ min{n,d}) 
bits. The point is that we get a lower bound that doesn’t depend on k- that is, with sparsity 
assumptions, it is impossible to improve both the loss and communication so that they depend 
on the intrinsic dimension k instead of the ambient dimension d. Moreover, in the regime when 
d/n —> c for a constant c, our lower bound matches the upper bound of ILS 1.4451 up to a loga¬ 
rithmic factor. The proof follows Theorem l4.5l and the reduction from Gaussian mean estimation 
to sparse linear regression of jZDJWIT straightforwardly and is deferred to Section [A] 

5 Direct-sum Theorem for Sparse Parameters 


Unknown parameter: v € {0,1} 

Inputs: Machine j gets n samples Xj = ..., ), where Xj is distributed according to [fif. 

1. All machines publicly sample k independent coordinates JiC [d] (without replace¬ 
ment). 

2. Each machine j locally prepares data Xj = ^Ayi,..., Xj ^j as follows: The Ii-th coordinate 

is embedded with the true data, Xj j 1 = Xj. For r = 2,... , k, j-th the machine draws Xj j r 
privately from distribution For any coordinate i £ [d]\{/i,..., /&}, the j-th machine draws 
privately Xjj from the distribution /Tq. 

3. The machines run protocol II with input data X. 

4. If 10(11)^1 > 5/2, then the machines output 1, otherwise they output 0. 

Protocol 1: direct-sum reduction for sparse parameter 
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We prove Theorem 14.41 in this section. Let II' be the protocol described in Protocol [T] Let 
9 £ be such that 9 A = v6 and 9j r = 6 for r = 2, ..., k, and 9 t = 0 for i £ [d]\{/i,..., /&}. 
We can see that by our construction, the distribution of Xj is the same as fig, and all Ay’s are 
independent. Also note that 9 is fc-sparse. Therefore when II' invokes II on data X , II will have 
loss R and information cost I with respect to X. 

We first verify that the protocol II does distinguish between v = 0 and v = 1. 

Proposition 5.1. Under the assumption of Theorem when v = 1, we have that 


and when v = 0 , we have 


E 


|0(nk - <s| 5 



E 


m) h \' 


< 


R 


d — k + 1 


( 21 ) 

( 22 ) 


Moreover, with probability at least 3/4, IT outputs the correct answer v. 


Proof. We know that II has mean-squared loss R, that is, 


i?((n,<9),0) = E ||0(n) -o\\ 


2 

2 


= E 


d 

E l^ n )* 

. 2=1 


9r | 


2 


Here the expectation is over the randomness of the protocol n and randomness of the samples 
Xi ,..., X m . We first prove equation (1^1) . that is 


E 


l<9(n) Zl I s 


< 


d- 


R 

k + 1 


Here the expectation is over /],..., A in addition to being over the randomness of n and 
the samples Xi ,..., X rn . We will in fact prove this claim for any fixing of I 2 ,..., /& to some 
« 2 , - - -, ik- Then I\ is a random coordinate in [d]\{i 2 ,..., ik}- Then 


E 


|^(H) Zl | 2 \I r =i r ,r>2 


d — k + 1 


E E\\9{H)i\ 2 \I r = i r ,r>2 


ie[d]\{i2,..-4fc} 


< 


1 


d — k + 1 


E E [|0(n)j| 2 \ I r =i r ,r>2 


1 ie[d]\{i 2 ,...,ifc} 


+ E E |"|0(n)j — <5 | 2 \ I r = i r ,r >2 


Taking expectation over J 2 ,..., I r we obtain 

d 


E 


10(11)/, | S 


< 


1 


d — k + 1 


E E !W*- 


1 


< 


d — k + 1 
R 

d — k + 1 


R((n,d),0) 
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In order to prove equation m, we prove the statement for every fixing of {/i,... ,4} to 
some 5 C [d]. 


e [l^(n)/j — s \ 2 1 {h,...,i k } = s 

= i^E[|0( n ) i -<5| 2 |{/ 1 ,...,4} = 5' 


ies 

/ 


E e 

Vies 

|(9(n)i — <5| 2 I {h,...,I k } = S 


1 

k 


E E r|<9(n)i — 5| 2 I {h,...,h} = S 


E E 

i$S 


l^(n)i| 2 1 {h,...,i k } = s 


Taking expectation over 4,..., 4 we obtain, 


E 


E 


|0(n) 7l -$\ 2 \{h,...,i k } = s 


±r.(( n, 0 ), 0 )<j 


The last statement of proposition follows easily from Markov’s inequality and the assumption 
that R < kS 2 / 16. □ 


Now we prove the information cost of the protocol IT under the case v = 0 is small. 
Proposition 5.2. Under the assumption of Theorem \4-4\ we have 

min-IC(lT) < I 0 (n , ; X l5 ..., X m \ R' puh ) < d _[ + l 

where Xj ~ fift and i? pub is the public coin used by IT. 


Proof. Let us denote ^xjV,..., xj”^ by X 7 y. that is, X hi is the collection of i-th coordinates 
of the samples on machine j. Let i? pub be the public coins used by protocol II. Note that R puh 
are just 4,..., 4 and t? pu b, therefore, the information cost of II' is 


Io(^ , ; X 1} ..., X m | R puh ) = I(II; Xi, 7l) ..., X m j 1 14, • • •, 4, 4 pub ) 


= E 
* 2 , 


/(II; , X m j 1 14,4 = 4, • • •, 4 = ik, Rpub ) (23) 


For each 4, ••■,4, we will prove that I(II; X xjl ,..., X mj/l |4, 4 = 4, ■••,4 = 4,-Rpub) < 
I/(d — k + 1). Note that conditioned on I r = i r for r > 2, 4 is uniform over [d]\{4» • • ■, 4-} 


1 


< 


< 


d — k + 1 

1 

d — k + 1 

1 

d — k + 1 

1 

d — k + 1 


• j X m ,Ii \Il 11* 

! = 4, • • 

• ? Ik — 

5 -^pub) 


E 

I(n ; Xy 

, 2 ? ■ • • 5 

,i|4 = i,h = 4, 

• ■ • ? Ik — -^pub) 

je[d]\{* 2 ,...,*fe} 





E 

i(n ; Xy 

2 , . . . , Xm 

,i|4 = 4) • • •, 4 

= 'ik i -^pub) 

ie[d]\{i 2 ,...,jfe} 





1—1 

EE 

E' 

• • 5 X m ,i 

) ie[d]\{i 2> - 

|4 = 4, ■ ■ • 

> -^pub^ 


(24) 


I(II, Xi,..., X m 14 — 4, * • *, 4 — i k , Ppub) 


(25) 
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The second equality follows from the fact that the distribution of X\,i,...,X mi i for 
i £ [d]\{* 2 , • • • ,*fc} does not depend on i and the protocol II is also oblivious of I\ and hence 
we can remove_the conditioning on I\ = i. First inequality follows from lemma [12] and the fact 
that , X m ^i are independent across i. The second inequality follows from the fact that 

I {A; B) < I{A-b’,C). 

Finally, note that II performs the task T(n, m, Q) with information cost / = sup e Ie(II;X | 
i? pu b). Note that conditioned on I r = i r and I\ = i, X are drawn from some valid pg with a 
fc-sparse 9. Therefore by the definition of information cost, we have that 

I(n ; X]_,... , X m \Ii = i, I2 = i 2 , ■ ■ • j Ik = Rpub) < I ( 26 ) 

Hence it follows from equations (l2dl) and (l25l) and (l26l) . we have that 

Io(H'; X u ...,X m | R' puh ) < d _ I k + l (27) 

ant it follows by definition that min-IC(n') < d _l +1 ■ □ 


6 Data Processing Inequality for Truncated Gaussian 

In this section, we prove Theorem 14.11 the SDPI for truncated guassian distributions. We 
first survey the connection between SDPI and transportation inequalities established by Ragin- 
sky jRag!4| in Section [Q1 Then we prove in Section lCOl that when a distribution has log-concave 
density function on a finite interval, it satisfies the transportation inequalities. These prepara¬ 
tions imply straightforwardly Theorem 14.11 which is proved in Section [6.31 


6.1 SDPI Constant and Transportation Inequality 

Usually in literature, the inequality © is referred to SDPI for mutual information. Here we 
introduce the more common version of strong data processing inequality, which turns out to be 
generally equivalent to SDPI for mutual information. 

Lemma 3. Consider the joint distribution of (V,X) where V ~ -B 1/2 and conditioned onV = v, 
we have X ~ fi v . Note that X is distributed according to the distribution /r = (/x 0 + ^ti)/2. 
By Bayes’ rule, we can define the reverse channel K : X —> V with transition probabilities 
{K(v\x) : v € {0,1}, a: £ K.} the same as the conditional probabilities Py\x of the above joint 
distribution. For any distribution v over R, let vK denote the distribution of the output v of K 
if the input x is distributed according to v. Then 


/3(/hn li\) = sup 


V^vKWplK) 

DkiHIli) 


(28) 


Thus, it suffices to bound from above the RHS of (051) . We use the technique developed in 
Theorem 3.7 of |Ragl4| , which relates the strong data processing inequality with the concentra¬ 
tion of measure and specifically the transportation inequality. 

To state the transportation inequality, we define the Wasserstein distance uq(-,-) between 
two probability measures, 

Definition 5. The w\ distance between two probability measure v over R is defined as 


wi {v,p) 


sup 

/:/ is 1-Lipschitz 



(29) 


15 











We will prove a simple transportation inequality relates the cost of transporting v to /i in 
Wasserstein distance w\ with the KL-divergence between v and p, 


w\{v,n) 2 < aD k \(v\\p). 


(30) 


for a certain value of a in section 16.21 For a complete survey of transportation inequalities 
with other cost functions, please see the survey of Gozlan and Leonard [GL10I . However, before 
proving the transportation inequality, we show how to use it to derive a bound on /3 (/zq , /^i) • 


Lemma 4 (A special case of Theorem 3.7 ITO4] )- Suppose for any v £ {0,1}, f v (x) = Pr[F = 
v | X = x] is L-Lipschitz, and transportation inequality ffl is true for n = (/io + Mi)/2 and 
any measure v , then 


/3(^ 0 ,/xi) = sup 


D kl (vK\\pK) 

DkiMI iA 


< aL 2 


(31) 


Proof of Lemma We basically follow the proof of Theorem 3.7 of |Ragl4| with some simplifi¬ 
cations and modifications. Note pl\ is the unbiased Bernoulli distribution and by the fact that 
KL divergence is not greater than y 2 distance, we have 


D^A-ll,*) < X \„K\\»K) = £ 

ue{o,i} ^ ^ 

= 2 (vK(v) - vK(v)) 2 

v€{0,l} 


(32) 


Fixing any v £ {0,1}, we have that 


\fiK(v) — vK(v)\ = 


j Pr[V = v | X = x\dp, — J Pr [V = v \ X = x\dv 

J f v (x)d[i - J f v {x)dv 

< Lwifu^p) 


(33) 


where the last inequality is by the definition of Wasserstein distance and the fact that f v (x) 
is L-Lipschitz. 

It follows from (1351) and (1551) that 


D k i(^A'||p.iv ) < L 2 w 2 (u, p). 
Then by transportation inequality (1301) we have that 


Dki(jzA'||/r/v) < L 2 w\(v,h) < aL 2 D(v\\p). 


□ 


6.2 Proving transportation inequality via concentration of measure 

In this subsection, we show that if p is log-concave then it satisfies transportation inequality 
m- To obtain the following theorem, we use a series of tools from the theory of concentration 
of measures in a straightforward way, albeit that in our setting, p has only support on a finite 
interval and therefore we need to take some additional care. 
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Theorem 6.1. Suppose pi is a measure defined on [a, b] with dpi = exp(— U(x))dx, and X 2 u{x) > 
c, then for any measure v we have 

wi(v,pi) 2 < - ■ D k i(zz|| pi)- (34) 

c 

In addition, it can be proved by direct calculation that if both pio and pii are log-concave 
and pio and pi\ are not too far away in some sense, then pi = (pio + pii)/2 is also log-concave with 
similar parameters. 

Lemma 5. Suppose distribution pio and pi± has supports on [a, b] with dpio = exp(—uo(x))dx and 
dpii = exp(— ui(x))dx. Suppose X 2 Uq(x) > c, and X 2 Uq(x) > c, and |Vuo(x) — Viti(a;)| < \[2 c 
then then pi = \(pio + Ah) satisfies that dpi = exp(—u(x))dx with X 2 u(x) > |. 

To prove Theorem 16.11 we exploit the well-established connections between transportation 
inequality, concentration of measure and log-Sobolev inequalities. First of all, transportation 
inequality (l34l) with Wasserstein w\ and KL-divergence ties closely to the concentration of 
probability measure pi. The theorem of Bobkov-Gotze established the exact connection: 

Theorem 6.2 (Bobkov-Gotze [BG991 Theorem 3.1). Let pi £ Pi be a probability measure on a 
metric space (X, d). Then the following two are equivalent for X ~ pi. 

1. w\(u,pi) < y/2a 2 D\d(i>\\pi) for all v. 

%. f(X) is o 2 -subgaussian for every 1-Lipschitz function f. 

Using Theorem 16.21 in order to prove Theorem 16.11 it suffices to prove the concentration 
of measure for f(X) when X ~ pi, and / is 1-Lipschitz. Although one might prove f(X) is 
subgaussian directly by definition, we use the log-Sobolev inequality to get around the tedious 
calculation. We begin by defining the entropy of a nonnegative random variable. 

Definition 6 . The entropy of the a nonnegative random variable Z is defined as 

Ent [Z\ := E[ZlogZ] - E[Z] logE [Z] (35) 

Entropy is very useful for proving concentration of measure. As illustrated in the following 
lemma, to prove X is subgaussian we only need to bound Ent[e AA ] by E[e AA ]. 

Lemma 6 (Herbst, c.f. |LedOll ). Suppose that for some random variable X, we have 

\2 2 

Ent[e AA '] < E[e AA ], for all A > 0 (36) 

Then 

V>(A) := logE[e A(A ' _EA) ] < for all A > 0 

and as an immediate consequences, X is a o 2 -subgaussian random variable. 

Therefore by Theorem 16.21 and Lemma [51 in order to prove transportation inequality, it 
suffices to to upper bound Ent M [e A ^] by E[e A ^]. It turns out that as long as the measure pi is 
log-concave, we get the concentration inequality for f(X) with 1-Lipschitz function /. 

Theorem 6.3 (Theorem 5.2 of [Led01| h Let dpi = e~ u dx where for some c > 0, S7 2 U(x) > c 
for all i£l. Then for all smooth function f on M, 

Ent M (/ 2 ) <- c J \Xf\ 2 dpi 

As a direct corollary, we obtain inequality (1361) that we are interested in. 
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Corollary 6.4. Let dp = e u dx where for some c > 0. V 2 t/(x) > c for all i(!l. Then for 
all 1-Lipschitz and smooth function f on R, and any A > 0, we have 

Ent M (e A/ ) < ^E[e xf ] 

Proof of Corollary \6.4\ Applying directly Theorem 16.31 on e x ^ 2 we obtain, 

Ent p [e A ^] < - j \Ve x ^ 2 \ 2 dfi = - J \e x ^ 2 ■ AV f/2\ 2 dp 
Note that if / is 1-Lipschitz, we have |Ve A t/ 2 | < | AAe A -^/ 2 1, and therefore 

Ent M [e A/ ] < ^ J e xf dp = ^-E^e^} 


□ 

The distributions that we are interested has continuous density function on a finite support 
and 0 elsewhere. Therefore we need to use a non-continuous version of the Corollary above to 
be rigorous. 

Corollary 6.5. Let S = [a, b] be a finite interval in R. Let dp = e~ u dx for x £ S and dp = 0 
for x S. Suppose for some c > 0, we have X 2 U{x) > c for all x £ S.Then the conclusion of 
Corollary \6.4\ is still true. 

Proof of Corollarv \6.5\ We first extend Theorem 16.31 to the finite support case. Let g be an 
extension of / to R, such that g is nonnegative and bounded above by some constant C , and 
V.g is also bounded by C. Let U n be a series of extensions of U to R such that the follow¬ 
ing happens: a) U n is twice-differentiable b) V 2 f7 n (x) > c for all x £ R c) p n = e~ Un dx 
approaches to p in TV norm as n tends to infinity. (The following choice will work for 
example, U n (x) = U(x) + 1 x> b ■ (VU(b)(x - b) + S7 2 U{b){x - b ) 2 + exp(n(x - 6) 4 )) +1 x<a ■ 
(yU(b){x — a) + V 2 f7(&)(x — a ) 2 + exp(n(x — a) 4 )). ) 

Since g and V.g are bounded, we have that | E^ n ( g 2 ) — E ii(g 2 )\ = f g 2 {dp n — dp) < C 2 \\p„ — 
mIItv —> 0 as n tends to infinity. Similarly we have that Ent Mrl ( g 2 ) Ent AI (g 2 ) and E M „ [| Vg| 2 ] —>■ 
E M [|Vg| 2 ]. Note that under p, g agrees with / and therefore we have that Ent^ n (g 2 ) Ent M (/ 2 ) 
and E^HV.gl 2 ] ->• E M [|V/| 2 ]. 

Also note that p n satisfies the condition of Theorem 16.31 therefore 

Ent Mn (g 2 ) < ^ j \S7g\ 2 dp n 

and the desired result follows by taking n to infinity. □ 

Finally we provide the proof of Lemma [5l which is obtained by direct calculation of the 
second derivatives of u(x). 

6.3 SDPI for truncated Gaussian 

We first check that the Lipschitz constants for f v (x) = Pr[V = 0 | X = x\ as defined in LemmaQ] 
The proof of the following lemma is deferred to Section ID.31 

Lemma 7. When X is generated by X ~ p v conditioned onV = v, let f v [x) = Pr[V = 0 | X = 
x], we have that f v (x) is p/Aa 2 -Lipschitz for any v £ {0,1}. 

We first prove Theorem 14.11 using Lemma 0 Theorem 16.11 and Lemma [4j 
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Proof of Theorem E3 Note that by definition on support [—t, r], dp' 0 = 70 exp(— uo(x))dx, and 
rfio = 7i exp(— ug(x))dx with uq(x) = — and ui(x) = — ^ ^ • By Lemma[5j we have that 
/i = (/j,' 0 + p \)/2 is l/cr 2 -log concave, and therefore by Theorem 16. II we have 

wi(V,m) 2 < 2cr 2 • D k i(i/||/x). 

By Lemma [Tj we have that /„’s are <5/4cr 2 -Lipschitz and therefore by Lemma [H we have that 

y3(/z 0 ,/xi) < (5 2 /cr 2 


□ 


Then we present the proof of Corollary 14.21 which relies on the following observation, whose 
proof is given in Section ID. 21 

Lemma 8. Suppose V —> (Xi,..., X n ) —» II forms a Markov Chain, where conditioned onV = 

v, {X \,..., X n ) are distributed according to fi v . Then V —> A'i H-h X n —> (Xi,..., X„) —► II 

ateo forms a Markov Chain. 

Now we are ready to prove Corollary 14. 2 1 

Proof. (Of corollarv l4.2D Let us restate what we want to prove. Suppose V ~ B i/ 2 , (A'i,..., X n )\V 
0 ~ po and (Xi ,..., X n )\V = 1 ~ jl\ and V —> {X\, ..., X n ) —► II be a Markov chain. Then 

nS 2 

I{n-V)<-^I{TL-X u ...,X n ) 

By lemma [HI V —» Xi + • • • + X n -> (Ai,..., X n ) —» II also forms a Markov chain. Then 


/(n; y) < ^ 5 -/( n ; Ai + • • • + X„) < Ai,..., A„) 

where the first inequality follows from Theorem 14.11 and the fact that the distribution of Xi + 

- \-X n \V = 0 is the Gaussian Af(Q, no 2 ) truncated to [—r, r] and the distribution of X\ H-b 

X„|V = 1 is the Gaussian J\f(nS,na 2 ) truncated to [—r, r]. The second inequality follows from 
data processing. □ 
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A Proofs of Results in Section [4] 

In this section, we prove Theorem 14.31 and Corollary 14.81 

Proof of Corollary \4-8\ Suppose there exists such a protocol with mean-squared loss R and 
communication cost C for sparse linear regression problem SLR(n, m, k, d,a 2 ). We are going to 
use it to solve the sparse linear regression problem SGME(m, 1, d, k, cjq) as follows. Suppose the 
I th machine has data X-, ~ M{0, aQldxd) with tro = X7n' Then the machines can prepare 

VSi = As,Xi + bi 

where bi ~ Af(0,a 2 I — crgAsW^.). Note that by the bound ||AsJ| < A /y/n, we have that 
<r 2 I - a^A Si A^. is positive semidefinite. Note that then ys t can written in the form 


VSi = A Si 0 + fi 

where ffs are independent distributed according to Af(0, cr 2 I nxn ) 

Then the machines call the protocol for the sparse linear regression problem with data 
(ySiiAsi). Therefore we obtain a protocol that solves SGME(to, 1, d, k, cro) with communication 
R and C. Then by Theorem 14.51 we know that 

rr 2 kd 

R ■ c > n(atkd) = D(——) 

A z n 


□ 
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B Tight Upper Bound with One-way Communication 

In this section, we describe a one-way communication protocol achieving the tight minimal 
communication for Gaussian mean estimation problem GME(n, to, d, <j 2 ) with the assumption 
that |0|oo < 

Note that for the design of protocol, it suffices to consider a one-dimensional problem. Proto¬ 
col [2] solves the one-dinrensional Gaussian mean estimation problem, with each machine sending 
exactly 1 bit, and therefore the total communication is m bits. To get a d-dimensional protocol, 
we just need to apply Protocol [5] to each dimension. In order to obtain the tradeoff as stated in 
Theorem 14.61 one needs to run Protocol [2] on the first am machines, and let the other machines 
be idle. 


Unknown parameter 0 € [—a/y/n : a/y/n] 
Inputs: Machine i gets n samples (X- ^,... 
• Simultaneously, each machine i 

1. Computes X t = ^ £”=i x\ 3) 

2. Sends Bi 

Bi 


, x\ n ^) where ~ N{6, a). 


f 1 if Xi > 0 

[ — 1 otherwise 


• Machine 1 computes 

where erf -1 is the inverse of the Gauss error function. 

• It returns the estimate 6 = -^0' where O' = max(min(T, 1),—1) is obtained by truncating T 
to the interval [—1,1]. 


Protocol 2: A simultaneous algorithm for estimating the mean of a normal distribution in the 
distributed setting. 


The correctness of the protocol follows from the following theorem. 

Theorem B.l. The algorithm described in Protocol [£| uses m bits of communication and 
achieves the following mean squared loss. 


E 


{o-ef 


= o 


mn ) 


where the expectation is over the random samples and the random coin tosses of the machines. 
Proof. Let 9 = 9^/n/a. 

Notice that X t is distributed according to AT(9. 1 ). Our goal is to estimate 9 from the Ah’s. 
By our assumption on 0, we have 0 € [—1,1]. 

The random variables Bi are independent with each other. We consider the mean and 
variance of Bi’s. For the mean we have that, 


E [Bi] = E [2 • Pr[0 < Xi] - 1] 
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For any i £ [m], Pr[0 < X,] = Pr[—A'j < 0] = 1 (0), where <1 >^^2 is the CDF of normal 

distribution Af(n, cr 2 ). Note the following relation between the error function and the CDF of a 
normal random variable 

* /x 1 1 „ / x — n 

<F U ct 2 (x) = — + — erf 

> 2 2 

Hence, 

E[Bi] =erf(0/V2). 

Let B = Bi’ th en we have that E[-B] = erf(0/\/2) < erf(l/-\/ 2 ) and therefore by a 

Chernoff bound, the probability that B > erf(l) or B < erf(—1) is exp(—D(m)). Thus, with 
probability at least 1 — exp(— fl(m)), we have erf(—1) < B < erf(l) and therefore \T\ < V2- 
Let £ be the event that \T\ < \/2, then we have that the error of 9 is bounded by 


E[\e' - 9\ 2 } = E[|0' - 9\ 2 I £} Pr[£] + E[|0' - 9\ 2 I £] Pv[£] 

< E[|-\/ 2 erf _ 1 (_B) - V2erf _1 (E[B])| 2 | £} Pr[£] + 2 Pr[£] 

= E[|-\/ 2 erf - 1 (_B) — \/2erf _ 1 (E[-B ])| 2 | £} Pr[£] + 2exp(—f2(m)) 


Let M = max er f-i( a; ) e [_ lil ] dcif da; ^ <3. Then we have that | erf 1 (a;) — erf 1 (y)| < M\x — y\ < 
0(1) -\x — y\ for any x, y £ [—1,1]. Therefore it follows that 


E[|d' — 9\ 2 } < E[|-\/2erf _ 1 (S) — x/2 erf -1 (E[i ?])| 2 | £\ Pr[£] + 2 exp(-D(ro)) 

< E[2M 2 \B — E[B}\ 2 | £} Pr[£] + 2exp(—D(m)) 

< E[2M 2 \B — E[H]| 2 ] + 2exp(—D(m)) 

< O ( — J + 2 exp(— 

\m J 



Hence we have that 


E 


\6-0f 


a 

= —E 
n 


\e' - 0| 5 


= o 


<7 

mn 


□ 


B.l Extension to general 9 

Now we do not assume that 9e £ [—a/\Jn, <j/y/n\ for each dimension l £ [d], and still show how 
to achieve a 1-round protocol with 0{mcL) bits of communication, up to low order terms. We 
will make the simplifying and standard assumptions though, that \9i\ <U = poly(md) for each 
i £ [d], as well as log(md 77 ,/cr) = o(m) and mdn/a > ( mdn) c for a constant c > 0 . 

The protocol. As before, it suffices to consider a one-dimensional problem. Protocol [3] 
solves the one-dimensional Gaussian mean estimation problem using 0(m + log 2 (mdn/a)) bits 
of communication. To solve the d-dimensional problem, we run the protocol independently on 
each coordinate. The total communication will be Ofrnd + dlog 2 (mdn/a)) bits. We fix i £ [d] 
and let 9 = 9e. Let 9 = 9yfn/a , where now we no longer assume 9 < 1. We will show the output 
9 satisfies: 

E[l ^ |2] = 0 (^)’ 
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Unknown parameter 9 

Inputs: Machine i gets n samples (X^,.... xf"' 1 ) where ~ AT(9, a). 

• Simultaneously, each machine i 

1. Computes X ?: = E"=i ^ 

2. If i < r = 0(log(mdn/cr)), machine i sends its first 0(\og{mdn/a)) bits of X t to the 
coordinator (Machine 1) 

3. Else if i > r, machine i 

(a) Computes Ri = X* — [XjJ, R[ = X* + 1/5 — [X t + 1/5J 

(b) Sends Bi and B[ 

B _ ( 1 with probability Ri 

(0 with probability 1 — Ri 

£>/ _ i 1 with probability R[ 

1 \ 0 with probability 1 — R[ 

• Machine 1 

1 . Computes an estimate 7 = — times the median of X/s sent by the first r machines. 

2 . Computes 

1 m 1 m 

T = - B i’ r = - E B 'i 

777 _ Z_ J 777 _ < y * / ^ 


3. Returns -^0 where 0 is a multiple of 1 /y/m — r satisfying |y — 9\ < 1/100 and certain 
agreement conditions with T, T' described in the text. 


Protocol 3: A simultaneous algorithm for estimating the mean of a normal distribution in the 
distributed setting without assuming | 0 | < cr/y/n. 


from which it follows that 

We now describe the one-dimensional problem for a given unknown mean 9. The first r = 
0(log(?nd?r/cr)) machines i send the first 0(log(mdn/cr)) bits of their (averaged) input Gaussians 
X, = -X= y" . A ,/ 7 ' 1 to the coordinator. Note that the random variables Xj are distributed 
according to J\f(9, 1). 

Since 0(\og(mdn/a)) bits of each Xi are communicated to the coordinator, since 9 < 
poly(md) • y/n/c r (here we use our assumption that \9(\ < poly(md) for each t £ [d]), and since 
each Xi has variance 1, it follows by standard Cliernoff bounds that the median 7 of Xi,..., X r 
is within an additive —U of 9 with probability 1 — ( m dn/er) a f° r an arbitrarily large constant 
a > 0 depending on the value r = 0(log(mdn/a)). We call this event £, so Pr[£] > 1— ^ md ^/ G ^ a ■ 

In parallel, machines r + 1, r + 2,..., m do the following. Let Ri £ [0,1) be such that 
Ri = Xj — |_XjJ. Similarly, let £ [0,1) be such that R' = X,; + 1/5 — [X; + 1/5J. 

For i = r + 1,..., m, the i-th machine sends a bit B, £ {0,1}, where 

Pr[Rj = 1] = Ri, 
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and the +th matchine also sends a bit B[ £ {0,1} where 

Pr [B' = 1] = K 


We describe the output of the coordinator in the proof of correctness below. Observe that the 
overall communication is 0(vn + log 2 (mdn / a)), as desired. 


Correctness. Consider the “sawtooth” wave f(x), which for a parameter L, satisfies f(x) = 
x/(2L) for x £ [0, 2 L), and is periodic with period 2 L. Its Fourier serie^f] is given by 



We set L = 1/2 and note that f(X/) = Ri■ Then, for X ~ N(6, 1), using a standard transfor¬ 
mation of the Gaussian distribution, 

E[sin(fX)] = e _i / 2 sin(f0), 


we have 


E [Bi] 


E[i?j] 

nf(Xi)} 


1 

2 

1 

2 


i 00 i 

- Y Te“ (fe7r/L)2/2 sm(kn6/L) 

k =1 ^ 

1 00 1 

— —e~ 2k 71 sin(2/c7T0). 

7r z —' k 

k =1 


Let B = — Y^iLr+i Bi, so that E[5] = E[^]. Since the Bi are Bernoulli random variables, 

E[|-B ^ E[£?]| 2 ] < —-— < —, (37) 

to — r m 

where the second inequality uses that r = 0(\og(mdn/a)) is at most m/2 under our assumption 
that log (mdn/a) = o(m). In an analogous fashion the coordinator computes a B' using the B[. 

If event £ occurs, then the coordinator knows 7 satisfying I7 — 0\ < yT_, and using 7 together 
with B , will output its estimate to 9 as follows. Let {2} = x — [ycj- The coordinator checks 
which of the two conditions 7 satisfies: 

1. 1/50 < {7} < 49/50 and ({7} - 1/4| > 3/100 and |{7> - 3/4| > 3/100 

2. 1/50 < {7 + 1/5} < 49/50 and ({7 + 1/5} - 1/4| > 3/100 and |{7 + 1/5} - 3/4| > 3/100. 

We note that one of these two conditions must be satisfied. To see this, suppose the first 
condition is not satisfied. If it is not satisfied because {7} < 1/50, then {7+ 1/5} £ [1/5,1/5 + 
1/50], which satisfies the second of the two conditions. If it is not satisfied because {7} > 49/50, 
then {7 + 1/5} £ [1/5 — 1/50,1/5], which satisfies the second of the two conditions. If the first 
condition is not satisfied because {7} £ [1/4 — 1/50,1/4+ 1/50], then {7 + 1/5} £ [9/20 — 
1/50,9/20 + 1/50] and the second condition is satisfied. If the first condition is not satisfied 
because {7} e [3/4- 1/50,3/4+ 1/50], then {7+ 1/5} e [19/20- 1/50,19/20+ 1/50], which 
satisfies the second condition. 

If the first condition holds, the coordinator will use B and estimate 9 below, otherwise it will 
use B' and estimate 9 + 1/5 below. We will analyze the first case; the second case is analogous. 

8 See, e.g., http: //mathworld. wolf ram. com/Four ierSeriesSawtoothWave .html 
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Note that since { 7 } > 1/50, and I 7 — 9\ < ygg, the coordinator learns Z = [0J. Its estimate 9 
for 9 is then Z + g(B), for a function g{B) to be specified (in the other case the coordinator 
would have learned {9 + 1/5} and 9 would have been {9 + 1/5} + g{B') — 1/5). 

To define g(B), we need the following claim. Note that in the first case K 7 } — 1/4| > 3/100 
and so by the triangle inequality |{0} — 1/4| > 3/100 — 7 = 1/50. Similarly, |{0} — 3/4| > 1/50, 
so the conditions of the following claim hold for { 0 }. 

Claim 1. Define h(x) = J/fcLi \ e ~ 2k n sin(2fc7rx). There exists a constant C > 0 with the 
following guarantee. If |{0} — 1/4| > 1/50 and |{0} — 3/4| > 1/50 then for any number x £ 
[{ 9 } - 1 / 100 , { 6 } + 1 / 100 ], 

C < h\x) < 1. 

Before proving the claim, we conclude the correctness proof. The coordinator guesses ^ 7 = 
for each integer i for which | Z + — y| < For each guess -^j=, the coordinator checks if 

00 1 , 

(38) 

k =1 

Note that, since the above Fourier series is periodic between succesive integers, we need not add 
Z to in (1381) . Let g(B) be the hrst guess which passes the check. The coordinator outputs 

9 = Z + g(B) as its estimate to 9 (the second case is analogous, in which Z corresponds to 
[9 + 1/5J and g(B') is defined in the same way). If there is no such g(B) the coordinator just 
outputs 7 . Note also that if its output ever exceeds our assumed upper bound U = poly (mnd/a) 
on the magnitude of 0, then we instead output U. 

Then 


E [\0-O\ 2 ] = 


< 


< 


E[|0-6 >| 2 
E[|0 — 9\ 2 

E[|(9-(9 | 2 

E[|0- 9\ 2 


£] Pr[£] + E[|0 - 9\ 2 | -£\ Pr[^] 

£]{1 - , } . r ) + 4 [/ 2 • -- l —— 

( mdn/cr) a ( nmd/a) a 

^ {mdn) ca ^ ( mdn) ca 


(39) 


where the first inequality uses our assumption that ( mdn/a ) > ( mdn) c for a constant c > 0 , 
and the second inequality holds for a sufficiently large constant a > 0 . 

Conditioned on £. we have 9 — 9 = g(B) — {0}. If (1551) holds for a given -)=, then 


OO 


IE 


1-2 k 2 n 2 

k 


sm(2kir—=) 

Jm 




Let T be the event that the coordinator finds such an for which (l38l) holds. We use the 
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shorthand h(z) to denote J2T= 1 i e 2fc27r2 sin(2fc7rz). 


E[|(9 — 6»| 2 | Eh F] = 

< 

< 

< 

< 

< 


E [|-^-{£}| 2 \£/\F\ 

y/m 

e[|m^)-MW)I 2 Ka^] 

V ill 

E[(|M^) - tt(| - B)| + |tt(| - B) - h({9}) I ) 2 I £ A -F] 

E[(—p= + \nd -B)- E[S ])|) 2 | £ A 

ym z z 

E[(^+7r|H-E[H ]|) 2 |f AJ] 

\Jm 

— + 2tt 2 E[|B - E[B] | 2 | £ A J 7 ] 


where the first equality follows from 9 — 9 = g(-B) — {0}, the first inequality uses the fact that 
the algorithm ensures |-^= — {(9}| < yE, given that £ occurs and therefore one can apply Claim 

ID with x = to conclude that \h{^=) — /i({0})| < |^= — {(9}|, the second inequality is 
the triangle inequality, the third inequality uses the guarantee on the value chosen by the 
coordinator and the definition of E[_B], the fourth inequality rearranges terms, and the fifth 
inequality uses (a + b) 2 < 2a 2 + 2b 2 . 

If there is no value for which (1551) holds, then since £ occurs it means there is no integer 
multiple of ^=, call it x, with \x — {0}| < —E, for which | h(x) — 7 t(E — B)\ < ^=. If it were the 
case that |E[£?] — B\ < where C > 0 is the constant of Claim[TJ then \^—^h(9) — B\ < 
or equivalently, | 7 r(E — B) — h{9)\ < EE. By Claim [T] though, we can find an x which is an 
integer multiple of -^= which is within of y, where h(y) = 7 r(i — B). This follows since the 
derivative on [{0} — 1/100, {9} + 1/100] is at least C. But then | h(x) — h(y)\ < \x — y\ < ^=, 
contradicting that (1551) did not hold. It follows that in this case |E(B] — B\ > yEE. Now in this 
case, we obtain an additive yj^ approximation, and so 1 9 — 9 1 2 < ^\B — E(B]| 2 . Hence, 


and so 


E[|0 - 6»| 2 | £ A -£F\ < 0(1) ■ E[|B - E[H ]| 2 | £ A -£F\, 


E[|0 - 6>| 2 | £\ < E[|6> — (9| 2 | £1, Pr[J^] -H E[|6» — (9| 2 | Pr[^J^] 

< +2tt 2 E[|H-E[H]| 2 | f A J"]Pr[jr] + 0(1) -E[|B-E[H]| 2 | £ A ~^J~] Pr[^.F] 

< O + 0(1) ■ E[|H - E[H]| 2 | £\ 


O 


where the final inequality uses E[|H — E[H ]| 2 | £] < ^ < 2E[|H — E[H]| 2 ], and (1371) . 

Combining this with (1551) completes the proof that E[|0 — 9 1 2 ] = 0(l/m). 


Proof of Claim. We need to understand the derivative, with respect to x, of the function 

oo 1 

h(x) = Y'' — e _2fc 77 sin(2fc7rx), 

rv 

k= 1 
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which is equal to 


OO 


h'(x) = ^ 2ne 2k 71 cos(2fc7ra). 
fe=1 

Note that the function is periodic in x with period 1, so we can restrict to x € [0,1). Consider 
0 = 27ra. Suppose first that | z — 7r/2| > e and \z — 37t/2| > e for a constant e > 0 to be 
determined. Then, 

| cos(27rz)| > cos(7r/2 — e) = sin(e) > 2e/n, 

using that cos(7r/2 —e) = sin(e) and that sin(x)/x > 2/tt for 0 < x < tt/2. In this case, it follows 
that 


\h'(x)\ > (27r)e 27r 2 e/ 7 r —^27re 2k n > 4e 27r e — 47 re 87r , 

k> 1 

using that the summation is dominated by a geometric series. Note that this expression is at 
least 4e~ 27r2 (e — 7 re -6 ’ 1 ’ 2 ), and so setting e = 27re -67r2 shows that \h!(x)\ = 12(1). Notice that x 
satisfies |27ra;—7r/2| > e provided \x— 1/4| > 1/100 > e/(27r) and that a; satisfies |27rx—37r/2| > e 
provided that \x — 3/4| > 1/100 > e/(27r). As |{0} — 1/4| > 1/50 and |{0} — 3/4| > 1/50, it 
follows that x € [{0} — 1/100, {9} + 1/100]. Hence, \h'(x)\ = 12(1) for such x, as desired. 

On the other hand, it is clear that h!(x) < 1, by upper bounding cos(2fc7rx) by 1 and using 
a geometric series to bound h'(x). □ 

C Distributed Gap Majority 

Our techniques can also be used to obtain a cleaner proof of the lower bound on the information 
complexity of distributed gap majority due to Woodruff and Zhang WZ12i . In this problem, 
there are k parties/machines and the i th machine receives a bit Zi. The machines communicate 
via a shared blackboard and their goal is to decide whether Xu=i z i < k/2 — \fk or JT =1 Zi > 
k/2 + Vk. In |WZ 12 | . it was proven that the information complexity of this problem is 12 (fc). 
We give a different proof using strong data processing inequalities. 

The distribution we will consider is the following: let B ~ B i/ 2 . Denote by 

[ii and B 1 / 2 _ w /^j: by /iq- If B = 1, sample Zi,...,Zk according to /Ltf. If B = 0, sample 
Z \,..., Zk according to ///. 

Theorem C.l. Suppose tt is a k-party protocol (with inputs Z \,..., Z^) and tt solves the gap 
majority problem (up to some error). Then /(n; Z \,..., Z^\B = 0) > f2(fc). 

n is the random variable for the transcript of the protocol tt. The intuition for the proof is pretty 
simple. It is not hard to verify that since tt solves the gap majority problem, it should be able 
to estimate B as well i.e. I(Il-,B) > 12(1). However since each Z, has only 0(l/fc) information 
about B , the protocol needs to gather information about S2(fc) of the Zi s. It is satisfying that 
this intuition can indeed be formalized! Perhaps worth noting that similar intuition can be 
drawn for the two-party gap hamming distance problem but there we don’t have a completely 
information theoretic proof of the linear lower bound |;CR11 . We will be using the strong data 
processing inequality for the binary symmetric channel first proven by |AG76| . it studies how 
information decays on a binary symmetric channel. Suppose A' be a bit distributed according 
to Bi/ 2 - Y be another bit obtained from A' by passing it through a binary symmetric channel 
with error 1/2 — e (i.e. Y remains A w.p. 1/2 + e and gets flipped w.p. 1/2 — e). Then for any 
random variable U s.t. U — X — Y is a Markov chain, I(U]Y) < 4e 2 I(U ; A). 

Proof. We will denote by H{, li ... : b fc the transcript of the protocol n when the inputs to tt are 
sampled according to /ib x ® pb 2 < 8 > • • • 0 Pb k ■ Since /(n; B) > 12(1), we know that h 2 (H 0 k , n^) > 
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£1(1). Now 


k 

I(H-,Z 1 ,...,Z k \B = 0)>Y / I(H-,Z i \B = 0) 

i-1 

Lets denote our distribution of II, Z \,..., Z k , B by p. We will tweak this distribution a little 
bit. Take an independent B' ~ B 1 / 2 - All the variables are distributed the same as p except Z, 
which is taken to be independently distributed as pb 1 - Denote the new distribution as p'. It is 
easy to verify that 


/(II; Z t \B = 0) p > /(n ; Zi\B = 0) p '/2 

This is true since in p, conditioned on B = 0, Zi has the distribution Bi/ 2 _ w / sqrtk and in p' it 
is B 1 / 2 (and hence use Lemma HT]) . We can also see that 

/(II; Z z \B = 0) p / > n {k • /(n ; B'\B = 0)p0 
> n(k- h 2 (n ei ,n 0k )) 


The first inequality is by strong data processing inequality for the binary symmetric channel 
and the second by Lemma fTOl Now 

k 

/(II; Z u ..., Z k \B = 0)>J2 ^ Z '\ B = °) 

2=1 

k 

>J2^{k-h 2 ( n ei ,n 0 ,)) 

2=1 

> n(k-h 2 ( n Qfc ,n lfc )) 

> n(k) 

The third inequality is by noting that IIf )1 ,... ! b fc satisfies a cut-and-paste property because n is a 
fc-party protocol and hence Theorem IE. II applies. □ 


D Missing Proofs in Section [6] 

D.l Proof of Lemma [5] 

Proof of Lemma 0 Let u{x) be such that dp = exp (—u(x))dx, that is, u(x) = — ln(i (exp(—uo(a;)) + exp(—ui(a;)))). 
We calculate u"(x) as follows: 

We can simply calculate the derivatives of u. For simplicity of notation, let h = exp(—ito(x))+ 
exp(—ui(a;)). We have that 


h! = —u' 0 exp(— mo) — mim^ exp(— mi), 


and 

h" = (m ' 0 2 - m") exp(-Mo) + K 2 - m") exp(-Mi). 

Therefore we have 


M 


ft 


-hh" + h' 2 

h ^ 

Mq exp(—2 mo) + u'{ exp(—2 mi) + (mq + u" — (u' 0 — m' x ) 2 ) exp(— U\ 
((m ' 0 2 - m") exp(-Mo) + (m ^ 2 - u'l) exp(-Mi)) 2 


u 2 ) 
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With some simple algebraic manipulations we have that h" > t (for t < min{/iQ, ji"}) is 
equivalent to 

(vVo - texp(-uo) - \J-texp(-ui)^j + Vo ~ t + V Mi - t) -(«(,+ ui) 2 ^ exp(-u 0 -wi) > 0 

Therefore, taking t = j- and under our assumptions that \n' Q (x) — Vi(x)\ < \[2 c for any 
x £ [a, 6], we have that u" > § as desired. □ 

D.2 Proof of Lemma [8] 

Let us look at the density of (Xi, ..., X n ) conditioned on X\ + ■ ■ ■ + X n = l < r and V = v. 
Suppose xi, - ■ - ,x n be such that JA xi = l, then for some normalizing constant C 


P{xi,- 


) = c- 


0 —(x\—v5) 2 /2a 2 _ . . — (x n — v&) 2 /2a 2 


,Xn\L, v j — ^ e -(l-nvS) 2 /2na 2 

_ (JfAl—nv5) 2 /2na 2 — (xi—vS) 2 /2a 2 

(l-nvSfj-n ^jixj-rS) 2 

- Cc 2 no 2 

1 

= Ce 2nrr 2 


which is independent of v and that proves the lemma. Note that we used the fact that JA Xi = l 
to simplify the expression. 

D.3 Proof of Lemma [7| 

The proof is by direct calculation. Note that by definition on support [—r, r], dfj,' 0 = 70 exp(— uo(x))dx, 
and mg = 7! exp (—uo(x))dx with Uq(x) = —-^2 and U\{x) = — , where 70 and 72 are scal¬ 

ing constants. Note that by the definition of the reverse channel K, 


fo(x) = Pv[V = 0\X = x] = 


7oe 


_ * 2 

7 oe 2 ^ 4 - 72 e 


Therefore 

/oO) = ^7o + 7i ex P( 
By AM-GM inequality we have 


2x6 - 6 2 


) -7o7i^exp( 


2x6 - 6 2 
2 a 2 


fo(x) < ( 47071 exp( 


2x5 - 6 2 
2a 2 


u 2x5 — 5 2 4<5 


Similarly for fi(y) we have 


fi{ x ) = 


7ie 


O.rr'^ 


(x — 5)2 

7oe~2^ +7!e 2 °-* 


and 


fi(x) = ^7i + 7o ex p( 


-2 x.6 + 6 2 . 
~2o 2 


-2 


) '7o7i^-exp( 


-5 ,-2x5+ 5 2 ^ -5 


2 ct 2 


) - 4er 2 


Also note that f 0 > 0 and /( < 0. Therefore for any v, /' is -^-Lipschitz 
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E Toolbox 

Lemma 9 (Folklore, Hellinger v.s. total variation). For any two distribution P,Q, we have 


h 2 (P,Q) < \\P-Q\\ T v<V2h(P,Q) 


Lemma 10. Let (f>(zi) and <j>(z 2 ) be two random variables. Let Z denote a random variable with 
uniform distribution in {zi,Z 2 }: Suppose <f(z) is independent of Z for each z £ {zi,Z 2 }-’ Then, 

2h 2 (<(> Zl , <j) Z2 ) > I(Z;(f>(Z)) > h 2 (</> Zl , <j>z 2 ) 


Proof. The lower bound of the mutual information follows from Lemma 6.2 of [BJKS04] . For 
the upper bound, we assume that for simplicity cj> has discrete support X , though the proof 
extends continuous random variable directly. We have 


i (z-A(z)) = 


< 


< 


2^ki (01II (0i + <h)/Z) + ^Dki(0 2 ||(0i + <h)/2) 

t;X 2 (0iII(0i +fc)/2) + (02II (0i + </>2)/2) 

1 (0i (x) - 02 (aQ) 2 lp (0i (a:) ~ 02 ( x )) 2 

M x ) + M x ) 4^, Mx) + fa(x) 

y- (0i jx) - 02(z)) 2 

itx (VMx) + VMx)) 2 

2h 2 (0i, 02) 


where the first inequality uses that KL-divergence is less than x 2 distance and the second one 
uses the inequality a 2 + b 2 > °~l . □ 

Theorem E.l (Corollary of Theorem 7 of |Jay09| ). Suppose a family of distribution {Pb : b £ 
{0, l} m } satisfies the cut-paste property: for any for any a,b and c,d with {ai,bi} = {ci,di} 
(in a multi-set sense) for every i £ [to], h 2 (II a ,nb) = h 2 (n c ,IId). Then we have 


^h 2 (Po,P e J>f2(l)-h 2 (P 0 ,P 1 ) (40) 

i=1 


where 0 and 1 are all 0’s and all l’s vectors respectively, and ej is the unit vector that only 
takes 1 in the ith entry. 

Proof. Theorem 7 of |Jay09| already proves a stronger version of this theorem for the to = 2 4 
case. Suppose on the other hand m = 2* + £ for £ < 2 4 . We divide [to] = {1,... ,m} into a 
collection of 2* subsets A±,..., A 2 t, each of which contains at most 2 elements. Let fi be the 
indicator vector of the subset Aj. For example, if Ai = {p, q}, then fi = e p + e q . We claim 
that EjeAi h 2 (P 0 ,P e .,) > 0(l)h 2 (P 0 ,P/ i ). This is trivial when \Ai\ = 1 and when Aj = {p,q}, 
we have that by CauchySchwarz inequality and the cut-paste property 


h 2 (P 0 ,P e J +h 2 (P 0 ,P e J > ih 2 (P e 


Therefore, we can lowerbound LHS as 


E h2 ( p °’ p ^) 

i -1 


1 

> - 

“ 2 


2* 

^h 2 (P 0 ,PyJ. 


2—1 
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Then applying Theorem 7 of |Jay09| on the RHS of the inequality above we have 

1 2 ‘ 

-^h 2 (P 0 ,PyJ>tt(l)-h 2 (Po,Pl), 

z i=l 


and the theorem follows. □ 

Lemma 11. Suppose two distribution p,p' satisfies p > c- p!. Let II(A') be a random function 
that only depends on X. If X ~ p and X' ~ p ', then we have that 

I(A;n(A))>c-I(A';II(A')) (41) 

Proof. Since p> c- p', we have that 

i(A;n(A))= E [D k i(n x ||n)] > c- E [D kl (n X '||n)] 

X~/j X'~p' 

Then note that 

e [D kl (n X /||n)] = e [D kl (n x -||n')] + D kl (n'||n) 

X’~n' 

It follows that 

I(A; 11(A)) > c- E [D kl (n x ||nO] = c-I(X'- t TL{X')) 

X~ju' 


□ 


Lemma 12 (Folklore). When X is drawn from a product distribution, then 

m 

^i(A i; n)<i(x ; n) 

2=1 
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